1. Objectives
  2. Introduction
  3. Clutering Analysis 2-1. Hierarchical clustering 2-2. Principal Component Analysis 2-3. K-means clustering analysis 2-4. Self-Organizing Map (SOM)
  4. eQTL analysis resources 3-1. GTEx eQTL Browser

0. Objectives

We want to know gene-phenotype correlations. This is the oldest and most central topic of genetics research. But in most cases, it’s difficult to associate an 1-to-1 relationship between gene and phenotype, except for some rare genetic disorders that occur in early childhood. However, transcript expression, although not visible to the naked eye, can be seen as a form of molecular phenotype for which to establish a correlation with DNA variants.

eQTL does just that. It describes the correlation between DNA variants and transcript expression level.

The following figure is a simplified overview of eQTL.

eQTL itself means the genomic region that contains sequence variants that affect another gene’s expression level. It’s similar to QTL in that it links phenotypic data with genotype data and looks for correlation between them. But in the case of eQTL, expression level is observed rather than the genotypic variation.

Well, the thing is, expression level is the mediating variable between genotype and phenotype. So it can help us gain insight and new interpretations in genomic data analysis. In this exercise, we learn various methods used in clustering anlaysis and use various eQTL analytical tools to reach an interpretation and conclusion of our own.

1. Introduction

Microarray data analysis: 1) preprocessing, 2) normalization, 3) differentially 4) expressed gene detection, 5) gene expression pattern analysis, 6) clustering analysis, 7) classification analysis, 8) biological annotations of the analyzed results, 9) and functional inference.

Clustering analysis can be viewed as a type of Unsupervised machine learning for pattern analysis or similarity structure analysis. Simply put, it’s the most primitive form of learning that clusters like things and takes apart not-like things. This approach is based on the expectation that these clusters to show specific pattern that can be distinguished from one another.

Clustering can be divided into two big categories:

  • Hierarchical clustering

  • Partitional clustering

There are also variations like SOM (or SOFM Self-Organizing Feature Map), and dimensionality reduction algorithms like PCA (principal component analysis) can be also called clustering analysis.

Clustering Methods

QTL (quantitative trait loci) method from traditional statistical genetics have evolved into eQTL (expression QTL) analytic method. QTL means certain regions in the genome that show correlation with certain phenotypic quantitative trait variation. So the good ole’ days’ Mendelian genetics explored the relationship between genotype and phenotype. eQTL takes account of the probable assumption that transcripts’ expression level regulation mediates genotype and phenotype. This was to address the problem that genotypes cannot fully explain genotype-phenotype relationship. It clearly wasn’t the full picture. It made a progress from the common notion that certain genotype determines a specific phenotype and proclaimed that there are indeed cases where certain genotypes can affect expression levels of other genes. Such regulation in expression level in turn manifests itself as a phenotype with complex interactions. So the expression level of a gene is defined as eQT (expression quantitative trait) and we look for the genomic region (loci) that are correlated with this variation in phenotype. Omics-level DNA RNA analysis enabled us to analyze the correlation between eQT and DNA position (L): eQT-L correlation \(\rightarrow\) And this, ladies and gentlemen, is eQTL.

Some even call the expression level as a mediating phenotype (intermediate phenotype) as molecular phenotype. There have been some accounts where candidate genomic regions discovered through GWAS were not protein-coding region. This made it difficult to infer and annotate the biological functional annotation. It’s also a possibility that some of them are actually eQTLs that regulate expression level of other genes. So eQTL begins with the simplistic assumption that variation in non-coding regions are probably variants that are related to gene expression regulation. These mechanism of action of such eQTL can either be cis-acting or trans-acting. Cis-eQTL implies that a nearby genetic variation in regions like promoter can directly and adjacently affects gene expression. Trans-eQTL is when the genomic variant’s product functions as an intermediary that indirectly regulates the gene expression of a second gene. Although it can be confounding, the terms cis and trans do not simply pertain to the classification based on the physical distance, as in “local” and “distant.”

So here’s the overall workflow of eQTL analysis exercise.

We have breast cancer sample x10 and normal sample x10, which are captured by Agilent chip WES. The sample data are microarray gene expression data of breast cancer. There are 10 samples for patient group (BC1~BC10) and 10 samples for normal control group (NB1~NB10). Now there are total of 159 genes, and the data is downloaded from breast cancer data from TCGA. The platform is Agilent Expression 244K microarrays and here’s how the data were preprocessed; it went through 1) Lowess normalization, 2) Gene level summarize (averaging probe expression value, 3) T-test (patient group vs. normal group), 4) differentially expressed genes selected (FDR adjusted p-value < 0.005). We’re gonna t-test over multiple genes and identify differentially expressed genes (159 genes, to be exact) by adjusted p-value of less than 0.005 (<0.005). This data is passed onto clustering analysis and eQTL resources. In clustering analysis, samples are clustered if they share certain expression features, by hierarchical clustering and PCA. Likewise, genes are also clustered either by K-means algorithm that clusters a given dataset into k-number of clusters.It’s a form of unsupervised machine learning that operates by minimizing the overall dispersion among the clusters.

In this exercise, we use packages som and ggplot2. Install packages by following code.

Since it wasn’t present on bioconductor, I downloaded som from cran.1

Before using som, let’s learn what the “Self-Organizing Map” is all about. Here’s a nicely written blog article on the subject. I will paraphrase this.

  1. So it randomly positions the grid’s neurons in the data space.

  2. Selects one data point either randomly or systematically cycling thorugh the dataset in order.

  3. Finds the neuron closest to the chosen data point. This neuron is called the “Best matching unit” (BMU).

  4. Now, move the BMU closer to that data point. The distance moved by the BMU is determined by a certain “learning rate”, which decreases after each iteration.

  5. Also, the algorithm moves BMU’s neighbors themselves closer to that data point as well, with farther away neighbors moving less.

  6. Neighbors are identified using a radius around the BMU, and the value for this radius decreases after each iteration.

  7. Update the learning rate and BMU radius, and repeat steps 1~4. Iterate these steps until positions of neurons have been stabilized.


So overall, the grid takes a distorted form with its unit neurons getting densely concentrated around actual clusters.

Load the libraries.

library(som)
library(ggplot2)

20 people’s samples, 159 genes’ expression profiles are stored in ~/eQTL/breast.txt. Each row indicates each gene and each row corresponds to each sample. The elements contain expression level profiles. Now, load the data using read.delim. It’s specifically for reading tab-delimited files into tables.

getwd()
## [1] "/home/seungho/GDA/eQTL"
brc=read.delim("breast.txt", header=TRUE, row.names=1)
brc= as.data.frame(brc)
head(brc)
##                 BC1        BC2          BC3         BC4       BC5        BC6
## MMP11     4.2066667  3.5826667  1.415600000  2.25550000  1.756000  3.3433330
## CDCA2    -2.1091667 -1.6561667 -2.623000000 -1.63600000 -3.151333 -2.7011670
## ASPM     -1.7484545 -2.7056429 -2.551785714 -1.71257143 -3.474769 -1.9912140
## TMEM88   -0.2888333 -0.5121667 -0.008333333  0.01033333  0.458500  0.4891667
## KIAA0101 -1.7595000 -1.4788000 -1.332100000 -1.42540000 -3.424600 -2.1232000
## ZDHHC9    0.3760000  0.6685000 -0.022500000  0.63900000  0.544000  0.8480000
##                BC7        BC8        BC9      BC10        NB1        NB2
## MMP11     1.394833  2.9648330  3.7255000  2.407333 -0.7341667 -0.4866667
## CDCA2    -2.021167 -2.6096670 -2.8086667 -1.380667 -3.6373333 -3.4475000
## ASPM     -1.974786 -2.3585710 -2.2496154 -1.349071 -3.7272143 -4.1337860
## TMEM88    0.870500  0.1896667  0.4393333  0.088000  1.7673333  1.5371670
## KIAA0101 -1.501900 -1.4268000 -1.6194000 -1.846900 -3.3982222 -3.6480000
## ZDHHC9    0.278500  0.8340000  0.5840000  1.016000 -0.0765000  0.1660000
##                 NB3        NB4       NB5        NB6        NB7        NB8
## MMP11    -0.6361667 -0.9161667  1.865000 -0.8198333 -0.1163333  0.1826667
## CDCA2    -4.1261667 -4.2118330 -3.456167 -3.4940000 -3.7546667 -2.9301667
## ASPM     -4.1231429 -4.5689290 -4.304750 -4.2784290 -3.4375000 -2.4461429
## TMEM88    1.2563333  1.5588330  1.798000  1.1176670  1.5915000  2.3540000
## KIAA0101 -4.3877000 -4.3592000 -3.280700 -3.8828890 -3.7111250 -1.8986667
## ZDHHC9   -0.2240000 -0.2230000 -0.927000 -0.5365000 -0.5605000 -0.4435000
##                NB9      NB10
## MMP11    -1.070000 -0.734000
## CDCA2    -3.994833 -3.899667
## ASPM     -5.107429 -4.921643
## TMEM88    2.695833  0.548000
## KIAA0101 -4.711125 -3.890600
## ZDHHC9    0.163500 -0.216000

2. Clustering Analysis

Microarray data has two axes: samples and genes. Consequently clustering can occur both ways: sample clustering and gene clustering. Gene clustering is when you cluster together the genes that have similar expression patterns over all the samples; and genes that are clustered together are tightly co-regulated genes under the samples’ conditions, having similar functionalities. Also we can cluster samples into clusters of samples that have similar expression patterns. For example, the exercise data may conveniently and expectedly be clustered into two clusters, 10 of breast cancer patients’ and 10 of normal control patients. It’s also possible to hierarchically cluster by both gene and samples, in two axes.

2-1. Hierarchical clustering

Hierarchical clustering is the most widely used method of clustering in microarray clustering analysis. There are two types: divisive and agglomerative. Usually agglomerative hierarchical clustering analysis is used, in which similarity between two individual subjects is measured and they are grouped together, up and up. Hierarchical clustering analysis will eventually cluster the entire dataset together as one big hierarchical tree or dendrogram, and this means we don’t have to select the target cluster number as in k-means or SOM. But we still have to determine the determining threshold to divide the clusters at some point.

Next example is a simple example of performing a hierarchical clustering analysis. We can set various variables and the most representative are distance function and linkage variable.

Distance functions:

  • euclidean distance
  • pearson correlation coefficient (linear correlation between two variables)
  • vector angle

Different distance functions can be used to different hierarchical clusters. It’s actually a very important problem to select the proper distance function that fits the purpose of a particular clustering analysis model, but it’s a bit over the scope of this practice so we’ll not cover this topic in depth here. Refer to this article. Default distance measure is the Euclidean distance. Correlation-based distance is often used in gene expression data analysis. Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.

Agglomerative clustering also can be divided into simple, complete, and average linkage, and more. Complete agglomerative usually results in better performance, but more computation is required. In this practice we perform the default, simple agglomerative hierarchical clustering analysis. Understanding the Concept of Hierarchical Clustering Technique

dst = dist(t(brc)) # Compute distance matrix. It's included in stats.
hc = hclust(dst) # Hierarchical clustering performed here. Also in stats.
plot(hc) # Draw the dendrogram

sc = c(rep("skyblue", 10), rep("pink", 10)) # Sample color settings
heatmap(as.matrix(brc), col=cm.colors(256), ColSideColors=sc) # Draw heatmap.

2-2. Principal Component Analysis (PCA)

Diverse mapping and dimensionality reduction techniques including PCA can also be used for clustering. PCA is a dimension reduction technique that minimizes information loss and discovers principal component axis that best represents the data. In genomic data analysis, PCA axis is composed of linearly combined information of many genes, and this is why it’s also called “supergene” or “metagene”, in that it supersedes the dimension of just one gene.

Perform PCA analysis with the following code block. PCA also requires various variable and argument settings and these affect the outcome greatly, but such oversteps the scope and purpose of this practice, so we will not discuss this in detail.

p=prcomp(t(brc)) # Perform PCA

sc = c(rep("slateblue", 10), rep("hotpink", 10)) # Choose sample color
la = c(rep("C", 10), rep("N", 10)) # Choose sample names

plot(p$x[,1:2], col=sc, pch=8) # Draw the result plot of PCA analysis.

# Only take the PC1 and PC2.

p$x[,1:2]
##             PC1        PC2
## BC1  -16.908535  0.8978444
## BC2  -16.114723 -2.0015157
## BC3  -19.414552  0.3805945
## BC4  -15.316779  0.8517747
## BC5   -3.596877 -2.2965054
## BC6   -9.308695 -2.7698020
## BC7   -4.825497 -1.2473441
## BC8  -15.261241 -1.3985995
## BC9  -18.195990 -2.6287455
## BC10 -17.919862  3.1176076
## NB1   11.696798  2.9042919
## NB2   13.723050  1.1911008
## NB3   16.453000 -1.9473951
## NB4   19.145211 -4.0321514
## NB5   14.436177 -0.4071183
## NB6    6.060183  1.0581736
## NB7   18.317468  1.9470519
## NB8    5.640119 11.6303669
## NB9   21.936732 -3.4652929
## NB10   9.454012 -1.7843363

plot(p$x[,1:2], col=sc, pch=8) # Draw the result plot of PCA analysis.
# Only take the PC1 and PC2.
text(x=p$x[,1], y=p$x[,2], labels=la, adj=-0.5)

summary(p)
## Importance of components:
##                            PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     15.0622 3.43518 3.06786 2.73188 2.1508 2.08678 1.71096
## Proportion of Variance  0.7946 0.04133 0.03296 0.02614 0.0162 0.01525 0.01025
## Cumulative Proportion   0.7946 0.83591 0.86888 0.89501 0.9112 0.92647 0.93672
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.64618 1.52002 1.4733 1.38106 1.26834 1.23081 1.15672
## Proportion of Variance 0.00949 0.00809 0.0076 0.00668 0.00563 0.00531 0.00469
## Cumulative Proportion  0.94621 0.95430 0.9619 0.96859 0.97422 0.97953 0.98421
##                           PC15    PC16    PC17    PC18    PC19      PC20
## Standard deviation     1.07904 1.01909 0.97974 0.86651 0.77090 2.225e-15
## Proportion of Variance 0.00408 0.00364 0.00336 0.00263 0.00208 0.000e+00
## Cumulative Proportion  0.98829 0.99193 0.99529 0.99792 1.00000 1.000e+00

2-3. k-means Clustering Analysis

k-means clustering analysis is one of the oldest clustering techniques. Numerous, actually thousands of improved clustering methods have been proposed, but k-means is still one of the favorite and most powerful techniques used in big data analysis because of its simplicity and following computational speed. Required variable in k-means clustering is, of course, k. K designates how many groups you’re gonna cluster your data into. K-means first randomly selects corresponding number of centroids in the data and gradually approaches the optimal distribution.

So in the following figure, 1) you can see that three means (k=3) are randomly initialized in the data. 2) And then the algorithm divides the exploratory space into three parts, segregating closest data points to a randomized mean, which in turn results in candidate clusters (expectation). 3) Then the means of each candidate clusters are taken and updated as the new means. (maximization) 4) Again, the exploratory spaces are divided into three parts (expectation), 5) and new candidate clusters again result in updated mean (maximization). These steps are iterated until the means converge towards to optimized approximation.

If you have got some time, try getting the optimal K, which is a fairly complicated problem that easily exceeds the scope of this exercise and also requires special individual attention to each data set. Next code clusters the data into 10 groups.

km = kmeans(brc, centers = 10, iter.max = 1000) # Perform K-means.
res = cbind(km$cluster, brc) # Get the results.
res[1:5, 1:5] # View the results.
##          km$cluster        BC1        BC2          BC3         BC4
## MMP11            10  4.2066667  3.5826667  1.415600000  2.25550000
## CDCA2             4 -2.1091667 -1.6561667 -2.623000000 -1.63600000
## ASPM              4 -1.7484545 -2.7056429 -2.551785714 -1.71257143
## TMEM88            7 -0.2888333 -0.5121667 -0.008333333  0.01033333
## KIAA0101          4 -1.7595000 -1.4788000 -1.332100000 -1.42540000

What does km look like?

head(km)
## $cluster
##     MMP11     CDCA2      ASPM    TMEM88  KIAA0101    ZDHHC9      TPM3      PHF6 
##        10         4         4         7         4        10         6         6 
##  TNFRSF18      JPH2      GPD1     MFSD5   ZDHHC24       GSN    ZNF553   FAM128A 
##         6         7         1        10         6         1        10         6 
##  C10orf90      TPX2    MANEAL     UBE2Z     PARP1     ROBO3      FHL1      EGR1 
##         9         4         2         6         2         9         3         1 
##      CKS2  C20orf96     S100B      RDH5    NUP210    IGSF10     CEP55     ANXA1 
##         4         6         3         8         2         5         4         7 
##     ESCO2    TRIM59   LGALS12     RAB26      IRF9     CDCA5       AMT       RGN 
##         4         6         8         6        10         4         9         9 
##      SOD3       MME  HIST1H3J     KCNE1    STARD9     CYR61     STMN1     GLYAT 
##         7         8        10         1         9         8         2         1 
##   PLEKHH2    PPAP2B     CLCN2     PRR15   FGFR1OP   ANKDD1A      ETS2    ABHD11 
##         9         9         6        10         6         9         9        10 
##      CAV1      CES1     SMOC1     CLCN6      AQP7     UBE2T    MBOAT2      PTX3 
##         3         3         5         9         8         4        10         5 
##     TOP2A    TMEM47      PLP1     PTPN6     NACAD     ACADL       NLK    CASP12 
##         4         7         5        10         7         1         6         7 
##    AKR1C1    CLEC4G       TK1     PTGS2      LIPE      FGF7  C18orf45    ENTPD7 
##         3         7         4         5         1         1        10        10 
##    BRI3BP    CORO2B      SORD      RBP4   TP53BP1      GPX3    ZNHIT2     NCBP1 
##         6         9        10         3         6         7        10         6 
##     AVPR2     KALRN    ACVR1C      NEK2    UBAP2L      BUB1  NUDT16L1    MLF1IP 
##         7         9         1         4         6         4        10         2 
##   GPR109B   C13orf3    LRRIQ2      LIG3    FAM70A    CX3CL1  C14orf80    WDR51A 
##         8         4         6         6         9         1         2         2 
##    DOPEY2      NME3   ATP6AP1  PPP1R15A      ATIC    AQP7P2     NR4A1     STRBP 
##        10        10        10         9         6         1         7         6 
##    CLEC3B  C20orf94     SOX12     CDAN1     CXCL3    PLAGL1      SGCG      PRC1 
##         1        10         6         6         5         9         3         4 
##      GPC3  KIAA1683      NTF3     POLA1     TACC3     ZWINT   CNTNAP3      KLF4 
##         5         9         9         6         4         2         5         7 
##    IGFBP6       VIP     DUSP1      HDGF   CCDC88C       HN1     ZBTB9   SLC22A3 
##         9         7         1         2        10         2        10         9 
##     CSTF2   C3orf41 LOC387911     CXCL2     NUAK2       CA4     ZFP36     FXYD1 
##         2         5         1         3         6         3         1         8 
##      PGM5      DPP6   COL10A1     SPRY2     CCBE1     LYVE1    KGFLP1      FOSB 
##         7         9        10         9         9         1         1         1 
##       FOS    EFCBP1     FANCI      IRS2     MKI67     NMUR1     KIF4A     TNNT3 
##         8         5         4         7         4         8         4         9 
##    SLC4A7 
##         2 
## 
## $centers
##             BC1        BC2         BC3         BC4        BC5         BC6
## 1   0.363075000  0.9869296  0.23614907  0.87863565  1.4158222  1.23847828
## 2  -0.724200000 -0.7835421 -0.94208102 -0.81419630 -1.5894458 -1.06205919
## 3  -2.204748148 -2.5277944 -3.06008148 -2.16313889 -0.9402185 -1.08809074
## 4  -1.852155502 -2.0983106 -2.10368331 -1.63737203 -3.7455969 -2.62728063
## 5  -3.377546591 -4.0980575 -3.95562440 -3.17998167 -2.2623850 -2.90449920
## 6  -0.007212784  0.1894422  0.07867242 -0.03569387 -0.4356375 -0.12348621
## 7  -0.110435000 -0.2310828 -0.45856778 -0.13049589  0.3694478 -0.01652166
## 8  -0.481019444 -0.3413380 -1.43108333 -0.96151852  0.4436991  0.49738884
## 9  -1.326689406 -1.2134337 -1.66722945 -1.16984229 -0.2185684 -0.88804788
## 10  1.513416099  1.5491164  1.17330046  1.31984851  0.7771399  1.16591625
##           BC7         BC8        BC9        BC10        NB1         NB2
## 1   2.3754676  0.77838474  0.3664718  0.63567136  3.7768426  3.67110928
## 2  -1.0056667 -0.87194581 -0.5976787 -0.45527542 -1.9828282 -2.04766867
## 3  -0.5671241 -2.30486122 -2.3545204 -2.24516844  1.0173500  1.57618886
## 4  -2.0911202 -2.07391503 -1.9828224 -1.16445242 -3.9672320 -4.07285784
## 5  -2.4772642 -3.79835630 -3.9719008 -3.87228500 -0.9562992 -0.46185125
## 6  -0.1178435  0.01534143  0.1339313 -0.09410268 -1.3201885 -1.18306411
## 7   0.6450744 -0.00328444 -0.2867215 -0.28392443  1.7280300  1.82731847
## 8   1.4509954 -0.18187037 -0.6460648 -0.61855556  2.4313009  2.95579633
## 9  -0.8273715 -1.26545330 -1.4766850 -1.53029277  0.4324883  0.71435088
## 10  1.1597494  1.41231046  1.6016769  1.10991135 -0.1328168 -0.02361199
##           NB3        NB4        NB5          NB6        NB7        NB8
## 1   4.1138588  4.1067658  4.0031185  2.482437389  4.3094292  2.6657120
## 2  -2.4380829 -2.6757083 -2.1182782 -1.990384667 -2.7001051 -1.9813028
## 3   1.9871889  2.6507316  2.4749870  0.252229633  2.0645222  0.3792880
## 4  -4.6324932 -5.0340579 -3.8072554 -4.215714053 -4.2643911 -2.6756837
## 5  -0.8652204 -0.7564237 -0.6359271 -1.474255000 -0.6872667 -0.9039658
## 6  -1.1153377 -1.1649309 -1.1552762 -0.981594208 -1.5380764 -1.3620886
## 7   1.8415822  1.7928395  1.8185856  1.048155560  2.0923167  2.1288344
## 8   3.2745370  4.0663658  3.2179352  1.535680667  3.6766250  1.5937176
## 9   0.7576912  0.5998011  0.5416594  0.354678565  0.7925300  0.3025057
## 10 -0.1412604 -0.2253308  0.1417405  0.003561614 -0.5913799 -0.3147723
##           NB9       NB10
## 1   4.3920282  3.0184356
## 2  -2.6363352 -2.1397810
## 3   3.7779037  0.7588093
## 4  -4.8524136 -4.6368213
## 5  -1.1419429 -1.1497679
## 6  -1.3570257 -0.9942895
## 7   2.4197939  1.2651672
## 8   3.9657176  2.2765556
## 9   0.8723458  0.3607403
## 10 -0.4746480  0.1975832
## 
## $totss
## [1] 13772.63
## 
## $withinss
##  [1] 291.90953  78.08482 161.22155 170.15470 237.03222 140.45566 135.53206
##  [8]  88.60056 190.42980 409.19959
## 
## $tot.withinss
## [1] 1902.62
## 
## $betweenss
## [1] 11870.01
write.table(res, 'kmeans.txt', sep =" ", row.names = TRUE, col.names=TRUE) # Make result flie

km$cluster["MMP11"]
## MMP11 
##    10

MMP11 gene belongs to cluster 1.

One can visualize how genes in the same cluster exhibit different expression patterns in breast cancer patient group and normal group.

brc’s columns 1 through 10 correspond to breast cancer samples and 11 through 20 correspond to normal samples.

c = apply(brc[,1:10], 1, mean) # Mean expression level in the breast cancer patient group
# for brc[, 1:10], row-wise, perform mean for each gene.
n = apply(brc[,11:20], 1, mean) # Mean expression level in the normal group
# for brc[, 11:20], row-wise, perform mean for each gene.
a = data.frame(c, n) # Data format conversion
cc = rainbow(10)[as.factor(km$cluster)] # Color coordination for each gene clusters.
plot(a, col=cc, pch=8) # pch=8 designates point symbol star.

Refer to this article for point symbol designation.

cc
##   [1] "#FF0099FF" "#33FF00FF" "#33FF00FF" "#0066FFFF" "#33FF00FF" "#FF0099FF"
##   [7] "#00FFFFFF" "#00FFFFFF" "#00FFFFFF" "#0066FFFF" "#FF0000FF" "#FF0099FF"
##  [13] "#00FFFFFF" "#FF0000FF" "#FF0099FF" "#00FFFFFF" "#CC00FFFF" "#33FF00FF"
##  [19] "#FF9900FF" "#00FFFFFF" "#FF9900FF" "#CC00FFFF" "#CCFF00FF" "#FF0000FF"
##  [25] "#33FF00FF" "#00FFFFFF" "#CCFF00FF" "#3300FFFF" "#FF9900FF" "#00FF66FF"
##  [31] "#33FF00FF" "#0066FFFF" "#33FF00FF" "#00FFFFFF" "#3300FFFF" "#00FFFFFF"
##  [37] "#FF0099FF" "#33FF00FF" "#CC00FFFF" "#CC00FFFF" "#0066FFFF" "#3300FFFF"
##  [43] "#FF0099FF" "#FF0000FF" "#CC00FFFF" "#3300FFFF" "#FF9900FF" "#FF0000FF"
##  [49] "#CC00FFFF" "#CC00FFFF" "#00FFFFFF" "#FF0099FF" "#00FFFFFF" "#CC00FFFF"
##  [55] "#CC00FFFF" "#FF0099FF" "#CCFF00FF" "#CCFF00FF" "#00FF66FF" "#CC00FFFF"
##  [61] "#3300FFFF" "#33FF00FF" "#FF0099FF" "#00FF66FF" "#33FF00FF" "#0066FFFF"
##  [67] "#00FF66FF" "#FF0099FF" "#0066FFFF" "#FF0000FF" "#00FFFFFF" "#0066FFFF"
##  [73] "#CCFF00FF" "#0066FFFF" "#33FF00FF" "#00FF66FF" "#FF0000FF" "#FF0000FF"
##  [79] "#FF0099FF" "#FF0099FF" "#00FFFFFF" "#CC00FFFF" "#FF0099FF" "#CCFF00FF"
##  [85] "#00FFFFFF" "#0066FFFF" "#FF0099FF" "#00FFFFFF" "#0066FFFF" "#CC00FFFF"
##  [91] "#FF0000FF" "#33FF00FF" "#00FFFFFF" "#33FF00FF" "#FF0099FF" "#FF9900FF"
##  [97] "#3300FFFF" "#33FF00FF" "#00FFFFFF" "#00FFFFFF" "#CC00FFFF" "#FF0000FF"
## [103] "#FF9900FF" "#FF9900FF" "#FF0099FF" "#FF0099FF" "#FF0099FF" "#CC00FFFF"
## [109] "#00FFFFFF" "#FF0000FF" "#0066FFFF" "#00FFFFFF" "#FF0000FF" "#FF0099FF"
## [115] "#00FFFFFF" "#00FFFFFF" "#00FF66FF" "#CC00FFFF" "#CCFF00FF" "#33FF00FF"
## [121] "#00FF66FF" "#CC00FFFF" "#CC00FFFF" "#00FFFFFF" "#33FF00FF" "#FF9900FF"
## [127] "#00FF66FF" "#0066FFFF" "#CC00FFFF" "#0066FFFF" "#FF0000FF" "#FF9900FF"
## [133] "#FF0099FF" "#FF9900FF" "#FF0099FF" "#CC00FFFF" "#FF9900FF" "#00FF66FF"
## [139] "#FF0000FF" "#CCFF00FF" "#00FFFFFF" "#CCFF00FF" "#FF0000FF" "#3300FFFF"
## [145] "#0066FFFF" "#CC00FFFF" "#FF0099FF" "#CC00FFFF" "#CC00FFFF" "#FF0000FF"
## [151] "#FF0000FF" "#FF0000FF" "#3300FFFF" "#00FF66FF" "#33FF00FF" "#0066FFFF"
## [157] "#33FF00FF" "#3300FFFF" "#33FF00FF" "#CC00FFFF" "#FF9900FF"

It seems that clusters can be divided into five groups that are more inclined to be expressed in breast cancer and five groups that have tendency to appear in normal tissue. Therefore, it seems feasible to name each of the two big clusters as “breast cancer cluster” and “normal cluster”.

Use ggplot for better visualization.

p = ggplot(a, aes(x=c, y=n, color=as.factor(km$cluster)))
p = p + geom_point(shape=8)
p = p + labs(colour = "gene cluster")
p

Better aesthetics with ggplot.

2-4. Self-Organizing Map (SOM)

Self-organizing map (SOM) is a clustering analysis using neural network models. K-means clustering analysis only divides the entire dataset into k number of clusters, but does not perform analysis of the relationship among the clusters. However, we’ve already seen from above example that, regardless of the arbitrary cluster number that we designate (k), there might be a more apparent underlying pattern that exists among the clusters. In the above case, there are k=10 clusters, but they can also be divided into two big clusters that correspond to ‘normal cell cluster’ and ‘breast cancer cluster’. Self-organizing map retains the characteristic of original lattice structure of neural networks, but also moves towards clustering. Therefore we can observe clustering itself while retaining information about the relationship between cluster to cluster.

Perform SOM analysis of 3 x 3 lattice structure.

sm = som(brc, xdim = 3, ydim = 3)
plot(sm)

The resulting figure is SOM clustering analysis result. One lattice signifies one cluster and neighboring clusters imply that they are more similar to each other than non-adjacent clusters. Number at the top of each lattice equals the number of genes in each cluster. Total of 159 genes have been clustered into 7 clusters and each cluster contains 8 to 40 genes per cluster. The graphs in each lattice represent gene expression profiles of samples that belong to that cluster. One is able to observe that neighboring lattices have similar looking expression profiles.

3. eQTL analysis resources

NCBI GTEx (Genotype-Tissue Expression) eQTL Browser and seeQTL (A searchable human eQTL browser and database) are most widely used eQTL databases. GTEx provides convenient filter searches with gene ID, dbSNP ID, phenotype traits, etc, which users can utilize to make a variety of queries.

3-1. GTEx eQTL Browser

Let’s go to NCBI GTEx and search the eQTL for breast cancer’s differentially expressed gene, TNNT3.

If you click on IGV Browser in Actions on the right-most column, you can access GTEx IGV Browser.

Then go to Show Track Menu and select Breast-Mammary Tissue.

Grey dots correspond to all the statistically significant gene-snp pairs that are incis eQTL relationship. Red dots show the variants that have FDR of less than 0.05 with TNNT3 gene expression.

You can see that rs2734495 is a significant eQTL to genes TNNT3 and MRPL23 in breast mammary tissue.